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ABSTRACT 


The past 3 years of work in NLP have been characterized by the 
development and deployment of ever larger language models, es- 
pecially for English. BERT, its variants, GPT-2/3, and others, most 
recently Switch-C, have pushed the boundaries of the possible both 
through architectural innovations and through sheer size. Using 
these pretrained models and the methodology of fine-tuning them 
for specific tasks, researchers have extended the state of the art 
on a wide array of tasks as measured by leaderboards on specific 
benchmarks for English. In this paper, we take a step back and ask: 
How big is too big? What are the possible risks associated with this 
technology and what paths are available for mitigating those risks? 
We provide recommendations including weighing the environmen- 
tal and financial costs first, investing resources into curating and 
carefully documenting datasets rather than ingesting everything on 
the web, carrying out pre-development exercises evaluating how 
the planned approach fits into research and development goals and 
supports stakeholder values, and encouraging research directions 
beyond ever larger language models. 
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1 INTRODUCTION 


One of the biggest trends in natural language processing (NLP) has 
been the increasing size of language models (LMs) as measured 
by the number of parameters and size of training data. Since 2018 
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alone, we have seen the emergence of BERT and its variants [39, 
70, 74, 113, 146], GPT-2 [106], T-NLG [112], GPT-3 [25], and most 
recently Switch-C [43], with institutions seemingly competing to 
produce ever larger LMs. While investigating properties of LMs and 
how they change with size holds scientific interest, and large LMs 
have shown improvements on various tasks (§2), we ask whether 
enough thought has been put into the potential risks associated 
with developing them and strategies to mitigate these risks. 

We first consider environmental risks. Echoing a line of recent 
work outlining the environmental and financial costs of deep learn- 
ing systems [129], we encourage the research community to priori- 
tize these impacts. One way this can be done is by reporting costs 
and evaluating works based on the amount of resources they con- 
sume [57]. As we outline in §3, increasing the environmental and 
financial costs of these models doubly punishes marginalized com- 
munities that are least likely to benefit from the progress achieved 
by large LMs and most likely to be harmed by negative environ- 
mental consequences of its resource consumption. At the scale we 
are discussing (outlined in §2), the first consideration should be the 
environmental cost. 

Just as environmental impact scales with model size, so does 
the difficulty of understanding what is in the training data. In §4, 
we discuss how large datasets based on texts from the Internet 
overrepresent hegemonic viewpoints and encode biases potentially 
damaging to marginalized populations. In collecting ever larger 
datasets we risk incurring documentation debt. We recommend 
mitigating these risks by budgeting for curation and documentation 
at the start of a project and only creating datasets as large as can 
be sufficiently documented. 

As argued by Bender and Koller [14], it is important to under- 
stand the limitations of LMs and put their success in context. This 
not only helps reduce hype which can mislead the public and re- 
searchers themselves regarding the capabilities of these LMs, but 
might encourage new research directions that do not necessarily 
depend on having larger LMs. As we discuss in §5, LMs are not 
performing natural language understanding (NLU), and only have 
success in tasks that can be approached by manipulating linguis- 
tic form [14]. Focusing on state-of-the-art results on leaderboards 
without encouraging deeper understanding of the mechanism by 
which they are achieved can cause misleading results as shown 
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in [21, 93] and direct resources away from efforts that would facili- 
tate long-term progress towards natural language understanding, 
without using unfathomable training data. 

Furthermore, the tendency of human interlocutors to impute 
meaning where there is none can mislead both NLP researchers 
and the general public into taking synthetic text as meaningful. 
Combined with the ability of LMs to pick up on both subtle biases 
and overtly abusive language patterns in training data, this leads 
to risks of harms, including encountering derogatory language and 
experiencing discrimination at the hands of others who reproduce 
racist, sexist, ableist, extremist or other harmful ideologies rein- 
forced through interactions with synthetic language. We explore 
these potential harms in §6 and potential paths forward in §7. 

We hope that a critical overview of the risks of relying on ever- 
increasing size of LMs as the primary driver of increased perfor- 
mance of language technology can facilitate a reallocation of efforts 
towards approaches that avoid some of these risks while still reap- 
ing the benefits of improvements to language technology. 


2 BACKGROUND 


Similar to [14], we understand the term language model (LM) to 
refer to systems which are trained on string prediction tasks: that is, 
predicting the likelihood of a token (character, word or string) given 
either its preceding context or (in bidirectional and masked LMs) 
its surrounding context. Such systems are unsupervised and when 
deployed, take a text as input, commonly outputting scores or string 
predictions. Initially proposed by Shannon in 1949 [117], some of 
the earliest implemented LMs date to the early 1980s and were used 
as components in systems for automatic speech recognition (ASR), 
machine translation (MT), document classification, and more [111]. 
In this section, we provide a brief overview of the general trend of 
language modeling in recent years. For a more in-depth survey of 
pretrained LMs, see [105]. 

Before neural models, n-gram models also used large amounts 
of data [20, 87]. In addition to ASR, these large n-gram models of 
English were developed in the context of machine translation from 
another source language with far fewer direct translation examples. 
For example, [20] developed an n-gram model for English with 
a total of 1.8T n-grams and noted steady improvements in BLEU 
score on the test set of 1797 Arabic translations as the training data 
was increased from 13M tokens. 

The next big step was the move towards using pretrained rep- 
resentations of the distribution of words (called word embeddings) 
in other (supervised) NLP tasks. These word vectors came from 
systems such as word2vec [85] and GloVe [98] and later LSTM 
models such as context2vec [82] and ELMo [99] and supported 
state of the art performance on question answering, textual entail- 
ment, semantic role labeling (SRL), coreference resolution, named 
entity recognition (NER), and sentiment analysis, at first in Eng- 
lish and later for other languages as well. While training the word 
embeddings required a (relatively) large amount of data, it reduced 
the amount of labeled data necessary for training on the various 
supervised tasks. For example, [99] showed that a model trained 
with ELMo reduced the necessary amount of training data needed 
to achieve similar results on SRL compared to models without, as 
shown in one instance where a model trained with ELMo reached 
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Year Model #of Parameters Dataset Size 
2019 BERT [39] 3.4E+08 16GB 
2019 DistilBERT [113] 6.60E+07 16GB 
2019 ALBERT [70] 2.23E+08 16GB 
2019 XLNet (Large) [150] 3.40E+08 126GB 
2020 ERNIE-GEN (Large) [145] 3.40E+08 16GB 
2019 RoBERTa (Large) [74] 3.55E+08 161GB 
2019 MegatronLM [122] 8.30E+09 174GB 
2020 T5-11B [107] 1.10E+10 745GB 
2020 T-NLG [112] 1.70E+10 174GB 
2020 GPT-3 [25] 1.75E+11 570GB 
2020. GShard [73] 6.00E+11 7 
2021  Switch-C [43] 1.57E+12 745GB 


Table 1: Overview of recent large language models 


the maximum development F1 score in 10 epochs as opposed to 
486 without ELMo. This model furthermore achieved the same F1 
score with 1% of the data as the baseline model achieved with 10% 
of the training data. Increasing the number of model parameters, 
however, did not yield noticeable increases for LSTMs [e.g. 82]. 

Transformer models, on the other hand, have been able to con- 
tinuously benefit from larger architectures and larger quantities of 
data. Devlin et al. [39] in particular noted that training on a large 
dataset and fine-tuning for specific tasks leads to strictly increasing 
results on the GLUE tasks [138] for English as the hyperparameters 
of the model were increased. Initially developed as Chinese LMs, the 
ERNIE family [130, 131, 145] produced ERNIE-GEn, which was also 
trained on the original (English) BERT dataset, joining the ranks 
of very large LMs. NVIDIA released the MegatronLM which has 
8.3B parameters and was trained on 174GB of text from the English 
Wikipedia, OpenWebText, RealNews and CC-Stories datasets [122]. 
Trained on the same dataset, Microsoft released T-NLG,! an LM 
with 17B parameters. OpenAI’s GPT-3 [25] and Google’s GShard 
[73] and Switch-C [43] have increased the definition of large LM by 
orders of magnitude in terms of parameters at 175B, 600B, and 1.6T 
parameters, respectively. Table 1 summarizes a selection of these 
LMs in terms of training data size and parameters. As increasingly 
large amounts of text are collected from the web in datasets such 
as the Colossal Clean Crawled Corpus [107] and the Pile [51], this 
trend of increasingly large LMs can be expected to continue as long 
as they correlate with an increase in performance. 

A number of these models also have multilingual variants such 
as mBERT [39] and mT5 [148] or are trained with some amount of 
multilingual data such as GPT-3 where 7% of the training data was 
not in English [25]. The performance of these multilingual mod- 
els across languages is an active area of research. Wu and Drezde 
[144] found that while mBERT does not perform equally well across 
all 104 languages in its training data, it performed better at NER, 
POS tagging, and dependency parsing than monolingual models 
trained with comparable amounts of data for four low-resource 
languages. Conversely, [95] surveyed monolingual BERT models 
developed with more specific architecture considerations or addi- 
tional monolingual data and found that they generally outperform 


Thttps://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter- 
language-model-by-microsoft/ 
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mBERT across 29 tasks. Either way, these models do not address 
the inclusion problems raised by [65], who note that over 90% of 
the world’s languages used by more than a billion people currently 
have little to no support in terms of language technology. 

Alongside work investigating what information the models re- 
tain from the data, we see a trend in reducing the size of these 
models using various techniques such as knowledge distillation 
(26, 58], quantization [118, 153], factorized embedding parame- 
terization and cross-layer parameter sharing [70], and progressive 
module replacing [146]. Rogers et al. [110] provide a comprehensive 
comparison of models derived from BERT using these techniques, 
such as DistilBERT [113] and ALBERT [70]. While these models 
maintain and sometimes exceed the performance of the original 
BERT model, despite their much smaller size, they ultimately still 
rely on large quantities of data and significant processing and stor- 
age capabilities to both hold and reduce the model. 

We note that the change from n-gram LMs to word vectors dis- 
tilled from neural LMs to pretrained Transformer LMs is paralleled 
by an expansion and change in the types of tasks they are use- 
ful for: n-gram LMs were initially typically deployed in selecting 
among the outputs of e.g. acoustical or translation models; the 
LSTM-derived word vectors were quickly picked up as more effec- 
tive representations of words (in place of bag of words features) 
in a variety of NLP tasks involving labeling and classification; and 
the pretrained Transformer models can be retrained on very small 
datasets (few-shot, one-shot or even zero-shot learning) to perform 
apparently meaning-manipulating tasks such as summarization, 
question answering and the like. Nonetheless, all of these systems 
share the property of being LMs in the sense we give above, that 
is, systems trained to predict sequences of words (or characters or 
sentences). Where they differ is in the size of the training datasets 
they leverage and the spheres of influence they can possibly affect. 
By scaling up in these two ways, modern very large LMs incur new 
kinds of risk, which we turn to in the following sections. 


3 ENVIRONMENTAL AND FINANCIAL COST 


Strubell et al. recently benchmarked model training and develop- 
ment costs in terms of dollars and estimated CO2 emissions [129]. 
While the average human is responsible for an estimated 5t COze 
per year,” the authors trained a Transformer (big) model [136] with 
neural architecture search and estimated that the training procedure 
emitted 284t of CO2. Training a single BERT base model (without 
hyperparameter tuning) on GPUs was estimated to require as much 
energy as a trans-American flight. 

While some of this energy comes from renewable sources, or 
cloud compute companies’ use of carbon credit-offset sources, the 
authors note that the majority of cloud compute providers’ energy is 
not sourced from renewable sources and many energy sources in the 
world are not carbon neutral. In addition, renewable energy sources 
are still costly to the environment,* and data centers with increasing 
computation requirements take away from other potential uses of 


2Data for 2017, from https://ourworldindata.org/co2-emissions, accessed Jan 21, 2021 
3https://www.heraldscotland.com/news/18270734.14m-trees-cut-scotland-make-way- 
wind-farms/ 
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green energy,’ underscoring the need for energy efficient model 
architectures and training paradigms. 

Strubell et al. also examine the cost of these models vs. their 
accuracy gains. For the task of machine translation where large 
LMs have resulted in performance gains, they estimate that an 
increase in 0.1 BLEU score using neural architecture search for 
English to German translation results in an increase of $150,000 
compute cost in addition to the carbon emissions. To encourage 
more equitable access to NLP research and reduce carbon footprint, 
the authors give recommendations to report training time and 
sensitivity to hyperparameters when the released model is meant 
to be re-trained for downstream use. They also urge governments to 
invest in compute clouds to provide equitable access to researchers. 

Initiatives such as the SustainNLP workshop? have since taken 
up the goal of prioritizing computationally efficient hardware and 
algorithms. Schwartz et al. [115] also call for the development of 
green AI, similar to other environmentally friendly scientific de- 
velopments such as green chemistry or sustainable computing. As 
shown in [5], the amount of compute used to train the largest deep 
learning models (for NLP and other applications) has increased 
300,000x in 6 years, increasing at a far higher pace than Moore’s 
Law. To promote green AI, Schwartz et al. argue for promoting 
efficiency as an evaluation metric and show that most sampled 
papers from ACL 2018, NeurIPS 2018, and CVPR 2019 claim accu- 
racy improvements alone as primary contributions to the field, and 
none focused on measures of efficiency as primary contributions. 
Since then, works such as [57, 75] have released online tools to 
help researchers benchmark their energy usage. Among their rec- 
ommendations are to run experiments in carbon friendly regions, 
consistently report energy and carbon metrics, and consider energy- 
performance trade-offs before deploying energy hungry models. 
In addition to these calls for documentation and technical fixes, 
Bietti and Vatanparast underscore the need for social and political 
engagement in shaping a future where data driven systems have 
minimal negative impact on the environment [16]. 

While [129] benchmarks the training process in a research set- 
ting, many LMs are deployed in industrial or other settings where 
the cost of inference might greatly outweigh that of training in 
the long run. In this scenario, it may be more appropriate to de- 
ploy models with lower energy costs during inference even if their 
training costs are high. In addition to benchmarking tools, works 
estimating the cost increase associated with the introduction of LMs 
for particular applications, and how they compare to alternative 
NLP methods, will be important for understanding the trade-offs. 

When we perform risk/benefit analyses of language technology, 
we must keep in mind how the risks and benefits are distributed, 
because they do not accrue to the same people. On the one hand, it 
is well documented in the literature on environmental racism that 
the negative effects of climate change are reaching and impacting 
the world’s most marginalized communities first [1, 27].° Is it fair or 
just to ask, for example, that the residents of the Maldives (likely to 
be underwater by 2100 [6]) or the 800,000 people in Sudan affected 


“https://news.microsoft.com/2017/11/02/microsoft-announces-one-of-the-largest- 
wind-deals-in-the-netherlands-with-vattenfall/ 
Shttps://sites.google.com/view/sustainlp2020/organization 
Shttps://www.un.org/sustainabledevelopment/blog/2016/10/report-inequalities- 
exacerbate-climate-impacts-on-poor/ 
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by drastic floods’ pay the environmental price of training and 
deploying ever larger English LMs, when similar large-scale models 
aren’t being produced for Dhivehi or Sudanese Arabic?® 

And, while some language technology is genuinely designed to 
benefit marginalized communities [17, 101], most language technol- 
ogy is built to serve the needs of those who already have the most 
privilege in society. Consider, for example, who is likely to both 
have the financial resources to purchase a Google Home, Amazon 
Alexa or an Apple device with Siri installed and comfortably speak 
a variety of a language which they are prepared to handle. Fur- 
thermore, when large LMs encode and reinforce hegemonic biases 
(see §§4 and 6), the harms that follow are most likely to fall on 
marginalized populations who, even in rich nations, are most likely 
to experience environmental racism [10, 104]. 

These models are being developed at a time when unprece- 
dented environmental changes are being witnessed around the 
world. From monsoons caused by changes in rainfall patterns due 
to climate change affecting more than 8 million people in India,’ 
to the worst fire season on record in Australia killing or displacing 
nearly three billion animals and at least 400 people,!° the effect 
of climate change continues to set new records every year. It is 
past time for researchers to prioritize energy efficiency and cost 
to reduce negative environmental impact and inequitable access 
to resources — both of which disproportionately affect people who 
are already in marginalized positions. 


4 UNFATHOMABLE TRAINING DATA 


The size of data available on the web has enabled deep learning 
models to achieve high accuracy on specific benchmarks in NLP 
and computer vision applications. However, in both application 
areas, the training data has been shown to have problematic charac- 
teristics [18, 38, 42, 47, 61] resulting in models that encode stereo- 
typical and derogatory associations along gender, race, ethnicity, 
and disability status [11, 12, 69, 69, 132, 132, 157]. In this section, 
we discuss how large, uncurated, Internet-based datasets encode 
the dominant/hegemonic view, which further harms people at the 
margins, and recommend significant resource allocation towards 
dataset curation and documentation practices. 


4.1 Size Doesn’t Guarantee Diversity 


The Internet is a large and diverse virtual space, and accordingly, it 
is easy to imagine that very large datasets, such as Common Crawl 
(“petabytes of data collected over 8 years of web crawling”,!! a 
filtered version of which is included in the GPT-3 training data) must 
therefore be broadly representative of the ways in which different 
people view the world. However, on closer examination, we find that 
there are several factors which narrow Internet participation, the 


Thttps://www.aljazeera.com/news/2020/9/25/over-800000-affected-in-sudan-flooding- 
un 

8By this comment, we do not intend to erase existing work on low-resource languages. 
One particularly exciting example is the Masakhane project [91], which explores 
participatory research techniques for developing MT for African languages. These 
promising directions do not involve amassing terabytes of data. 
°https://www.voanews.com/south-central-asia/monsoons-cause-havoc-india-climate- 
change-alters-rainfall-patterns 
10https://www.cnn.com/2020/07/28/asia/australia-fires-wildlife-report-scli-intl- 
scn/index.html 

“http://commoncrawl.org/ 
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discussions which will be included via the crawling methodology, 
and finally the texts likely to be contained after the crawled data 
are filtered. In all cases, the voices of people most likely to hew to 
a hegemonic viewpoint are also more likely to be retained. In the 
case of US and UK English, this means that white supremacist and 
misogynistic, ageist, etc. views are overrepresented in the training 
data, not only exceeding their prevalence in the general population 
but also setting up models trained on these datasets to further 
amplify biases and harms. 

Starting with who is contributing to these Internet text collec- 
tions, we see that Internet access itself is not evenly distributed, 
resulting in Internet data overrepresenting younger users and those 
from developed countries [100, 143].!* However, it’s not just the In- 
ternet as a whole that is in question, but rather specific subsamples 
of it. For instance, GPT-2’s training data is sourced by scraping out- 
bound links from Reddit, and Pew Internet Research’s 2016 survey 
reveals 67% of Reddit users in the United States are men, and 64% 
between ages 18 and 29.13 Similarly, recent surveys of Wikipedians 
find that only 8.8-15% are women or girls [9]. 

Furthermore, while user-generated content sites like Reddit, 
Twitter, and Wikipedia present themselves as open and accessible 
to anyone, there are structural factors including moderation prac- 
tices which make them less welcoming to marginalized populations. 
Jones [64] documents (using digital ethnography techniques [63]) 
multiple cases where people on the receiving end of death threats 
on Twitter have had their accounts suspended while the accounts 
issuing the death threats persist. She further reports that harass- 
ment on Twitter is experienced by “a wide range of overlapping 
groups including domestic abuse victims, sex workers, trans people, 
queer people, immigrants, medical patients (by their providers), 
neurodivergent people, and visibly or vocally disabled people” The 
net result is that a limited set of subpopulations can continue to 
easily add data, sharing their thoughts and developing platforms 
that are inclusive of their worldviews; this systemic pattern in turn 
worsens diversity and inclusion within Internet-based communica- 
tion, creating a feedback loop that lessens the impact of data from 
underrepresented populations. 

Even if populations who feel unwelcome in mainstream sites set 
up different fora for communication, these may be less likely to be 
included in training data for language models. Take, for example, 
older adults in the US and UK. Lazar et al. outline how they both in- 
dividually and collectively articulate anti-ageist frames specifically 
through blogging [71], which some older adults prefer over more 
popular social media sites for discussing sensitive topics [24]. These 
fora contain rich discussions about what constitutes age discrimi- 
nation and the impacts thereof. However, a blogging community 
such as the one described by Lazar et al. is less likely to be found 
than other blogs that have more incoming and outgoing links. 

Finally, the current practice of filtering datasets can further atten- 
uate the voices of people from marginalized identities. The training 
set for GPT-3 was a filtered version of the Common Crawl dataset, 
developed by training a classifier to pick out those documents 


12-This point is also mentioned in the model card for GPT-3: https://github.com/openai/ 
gpt-3/blob/master/model-card.md 

Shttps://www _journalism.org/2016/02/25/reddit-news-users-more-likely-to-be-male- 
young-and-digital-in-their-news-preferences/ 
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most similar to the ones used in GPT-2’s training data, ie. docu- 
ments linked to from Reddit [25], plus Wikipedia and a collection 
of books. While this was reportedly effective at filtering out docu- 
ments that previous work characterized as “unintelligible” [134], 
what is unmeasured (and thus unknown) is what else it filtered out. 
The Colossal Clean Crawled Corpus [107], used to train a trillion 
parameter LM in [43], is cleaned, inter alia, by discarding any page 
containing one of a list of about 400 “Dirty, Naughty, Obscene or 
Otherwise Bad Words” [p.6].!4 This list is overwhelmingly words 
related to sex, with a handful of racial slurs and words related to 
white supremacy (e.g. swastika, white power) included. While possi- 
bly effective at removing documents containing pornography (and 
the associated problematic stereotypes encoded in the language of 
such sites [125]) and certain kinds of hate speech, this approach will 
also undoubtedly attenuate, by suppressing such words as twink, 
the influence of online spaces built by and for LGBTQ people. If 
we filter out the discourse of marginalized populations, we fail to 
provide training data that reclaims slurs and otherwise describes 
marginalized identities in a positive light. 

Thus at each step, from initial participation in Internet fora, to 
continued presence there, to the collection and finally the filtering 
of training data, current practice privileges the hegemonic view- 
point. In accepting large amounts of web text as ‘representative’ 
of ‘all’ of humanity we risk perpetuating dominant viewpoints, 
increasing power imbalances, and further reifying inequality. We 
instead propose practices that actively seek to include communities 
underrepresented on the Internet. For instance, one can take inspi- 
ration from movements to decolonize education by moving towards 
oral histories due to the overrepresentation of colonial views in 
text [35, 76, 127], and curate training datasets through a thoughtful 
process of deciding what to put in, rather than aiming solely for 
scale and trying haphazardly to weed out, post-hoc, flotsam deemed 
‘dangerous’, ‘unintelligible’, or ‘otherwise bad’. 


4.2 Static Data/Changing Social Views 


A central aspect of social movement formation involves using lan- 
guage strategically to destabilize dominant narratives and call at- 
tention to underrepresented social perspectives. Social movements 
produce new norms, language, and ways of communicating. This 
adds challenges to the deployment of LMs, as methodologies re- 
liant on LMs run the risk of ‘value-lock’, where the LM-reliant 
technology reifies older, less-inclusive understandings. 

For instance, the Black Lives Matter movement (BLM) influenced 
Wikipedia article generation and editing such that, as the BLM 
movement grew, articles covering shootings of Black people in- 
creased in coverage and were generated with reduced latency [135]. 
Importantly, articles describing past shootings and incidents of po- 
lice brutality were created and updated as articles for new events 
were created, reflecting how social movements make connections 
between events in time to form cohesive narratives [102]. More 
generally, Twyman et al. [135] highlight how social movements 
actively influence framings and reframings of minority narratives 


14 Available at https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and- 
Otherwise-Bad-Words/blob/master/en, accessed Jan 18, 2021 
15 This observation is due to William Agnew. 
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in the type of online discourse that potentially forms the data that 
underpins LMs. 

An important caveat is that social movements which are poorly 
documented and which do not receive significant media attention 
will not be captured at all. Media coverage can fail to cover protest 
events and social movements [41, 96] and can distort events that 
challenge state power [36]. This is exemplified by media outlets 
that tend to ignore peaceful protest activity and instead focus on 
dramatic or violent events that make for good television but nearly 
always result in critical coverage [81]. As a result, the data under- 
pinning LMs stands to misrepresent social movements and dispro- 
portionately align with existing regimes of power. 

Developing and shifting frames stand to be learned in incomplete 
ways or lost in the big-ness of data used to train large LMs — particu- 
larly if the training data isn’t continually updated. Given the com- 
pute costs alone of training large LMs, it likely isn’t feasible for 
even large corporations to fully retrain them frequently enough to 
keep up with the kind of language change discussed here. Perhaps 
fine-tuning approaches could be used to retrain LMs, but here again, 
what would be required is thoughtful curation practices to find ap- 
propriate data to capture reframings and techniques for evaluating 
whether such fine-tuning appropriately captures the ways in which 
new framings contest hegemonic representations. 


4.3 Encoding Bias 


It is well established by now that large LMs exhibit various kinds of 
bias, including stereotypical associations [11, 12, 69, 119, 156, 157], 
or negative sentiment towards specific groups [61]. Furthermore, 
we see the effects of intersectionality [34], where BERT, ELMo, GPT 
and GPT-2 encode more bias against identities marginalized along 
more than one dimension than would be expected based on just the 
combination of the bias along each of the axes [54, 132]. Many of 
these works conclude that these issues are a reflection of training 
data characteristics. For instance, Hutchinson et al. find that BERT 
associates phrases referencing persons with disabilities with more 
negative sentiment words, and that gun violence, homelessness, 
and drug addiction are overrepresented in texts discussing mental 
illness [61]. Similarly, Gehman et al. show that models like GPT-3 
trained with at least 570GB of data derived mostly from Common 
Crawl’ can generate sentences with high toxicity scores even when 
prompted with non-toxic sentences [53]. Their investigation of GPT- 
2’s training data!” also finds 272K documents from unreliable news 
sites and 63K from banned subreddits. 

These demonstrations of biases learned by LMs are extremely 
valuable in pointing out the potential for harm when such models 
are deployed, either in generating text or as components of classi- 
fication systems, as explored further in §6. However, they do not 
represent a methodology that can be used to exhaustively discover 
all such risks, for several reasons. 

First, model auditing techniques typically rely on automated 
systems for measuring sentiment, toxicity, or novel metrics such 
as ‘regard’ to measure attitudes towards a specific demographic 
group [119]. But these systems themselves may not be reliable 


16https://commoncrawl.org/the-data/ 
17GPT-3’s training data is not openly available, but GPT-2’s training data was used 
indirectly to construct GPT-3’s [53]. 
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means of measuring the toxicity of text generated by LMs. For 
example, the Perspective API model has been found to associate 
higher levels of toxicity with sentences containing identity markers 
for marginalized groups or even specific names [61, 103]. 

Second, auditing an LM for biases requires an a priori under- 
standing of what social categories might be salient. The works cited 
above generally start from US protected attributes such as race and 
gender (as understood within the US). But, of course, protected 
attributes aren’t the only identity characteristics that can be subject 
to bias or discrimination, and the salient identity characteristics 
and expressions of bias are also culture-bound [46, 116]. Thus, com- 
ponents like toxicity classifiers would need culturally appropriate 
training data for each context of audit, and even still we may miss 
marginalized identities if we don’t know what to audit for. 

Finally, we note that moving beyond demonstrating the exis- 
tence of bias to building systems that verify the ‘safety’ of some 
LM (even for a given protected class) requires engaging with the 
systems of power that lead to the harmful outcomes such a system 
would seek to prevent [19]. For example, the #MeToo movement has 
spurred broad-reaching conversations about inappropriate sexual 
behavior from men in power, as well as men more generally [84]. 
These conversations challenge behaviors that have been historically 
considered appropriate or even the fault of women, shifting notions 
of sexually inappropriate behavior. Any product development that 
involves operationalizing definitions around such shifting topics 
into algorithms is necessarily political (whether or not developers 
choose the path of maintaining the status quo ante). For example, 
men and women make significantly different assessments of sexual 
harassment online [40]. An algorithmic definition of what con- 
stitutes inappropriately sexual communication will inherently be 
concordant with some views and discordant with others. Thus, an 
attempt to measure the appropriateness of text generated by LMs, or 
the biases encoded by a system, always needs to be done in relation 
to particular social contexts and marginalized perspectives [19]. 


4.4 Curation, Documentation & Accountability 


In summary, LMs trained on large, uncurated, static datasets from 
the Web encode hegemonic views that are harmful to marginalized 
populations. We thus emphasize the need to invest significant re- 
sources into curating and documenting LM training data. In this, 
we follow Jo et al. [62], who cite archival history data collection 
methods as an example of the amount of resources that should be 
dedicated to this process, and Birhane and Prabhu [18], who call for 
a more justice-oriented data collection methodology. Birhane and 
Prabhu note, echoing Ruha Benjamin [15], “Feeding AI systems on 
the world’s beauty, ugliness, and cruelty, but expecting it to reflect 
only the beauty is a fantasy-’ [p.1541] 

When we rely on ever larger datasets we risk incurring doc- 
umentation debt,!® i.e. putting ourselves in a situation where the 
datasets are both undocumented and too large to document post hoc. 
While documentation allows for potential accountability [13, 52, 86], 
undocumented training data perpetuates harm without recourse. 
Without documentation, one cannot try to understand training data 
characteristics in order to mitigate some of these attested issues 
or even unknown ones. The solution, we propose, is to budget for 


18On the notion of documentation debt as applied to code, rather than data, see [154]. 
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documentation as part of the planned costs of dataset creation, and 
only collect as much data as can be thoroughly documented within 


that budget. 


5 DOWN THE GARDEN PATH 


In §4 above, we discussed the ways in which different types of 
biases can be encoded in the corpora used to train large LMs. In 
§6 below we explore some of the risks and harms that can follow 
from deploying technology that has learned those biases. In the 
present section, however, we focus on a different kind of risk: that 
of misdirected research effort, specifically around the application 
of LMs to tasks intended to test for natural language understanding 
(NLU). As the very large Transformer LMs posted striking gains 
in the state of the art on various benchmarks intended to model 
meaning-sensitive tasks, and as initiatives like [142] made the mod- 
els broadly accessible to researchers seeking to apply them, large 
quantities of research effort turned towards measuring how well 
BERT and its kin do on both existing and new benchmarks.! This 
allocation of research effort brings with it an opportunity cost, on 
the one hand in terms of time not spent applying meaning captur- 
ing approaches to meaning sensitive tasks, and on the other hand in 
terms of time not spent exploring more effective ways of building 
technology with datasets of a size that can be carefully curated and 
available for a broader set of languages [65, 91]. 

The original BERT paper [39] showed the effectiveness of the 
architecture and the pretraining technique by evaluating on the 
General Language Understanding Evaluation (GLUE) benchmark 
[138], the Stanford Question Answering Datasets (SQUAD 1.1 and 
2.0) [108], and the Situations With Adversarial Generations bench- 
mark (SWAG) [155], all datasets designed to test language under- 
standing and/or commonsense reasoning. BERT posted state of 
the art results on all of these tasks, and the authors conclude by 
saying that “unsupervised pre-training is an integral part of many 
language understanding systems.” [39, p.4179]. Even before [39] 
was published, BERT was picked up by the NLP community and 
applied with great success to a wide variety of tasks [e.g. 2, 149]. 

However, no actual language understanding is taking place in 
LM-driven approaches to these tasks, as can be shown by careful 
manipulation of the test data to remove spurious cues the systems 
are leveraging [21, 93]. Furthermore, as Bender and Koller [14] 
argue from a theoretical perspective, languages are systems of 
signs [37], ie. pairings of form and meaning. But the training data 
for LMs is only form; they do not have access to meaning. Therefore, 
claims about model abilities must be carefully characterized. 

As the late Karen Sparck Jones pointed out: the use of LMs 
ties us to certain (usually unstated) epistemological and method- 
ological commitments [124]. Either i) we commit ourselves to a 
noisy-channel interpretation of the task (which rarely makes sense 
outside of ASR), ii) we abandon any goals of theoretical insight into 
tasks and treat LMs as “just some convenient technology” [p.7], or 
iii) we implicitly assume a certain statistical relationship — known 
to be invalid — between inputs, outputs and meanings.”° Although 


19.26% of papers sampled from ACL, NAACL and EMNLP since 2018 cite [39]. 
20 Specifically, that the mutual information between the input and the meaning given 
the output is zero — what Sparck Jones calls “the model of ignorance”. 
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she primarily had n-gram models in mind, the conclusions remain 
apt and relevant. 

There are interesting linguistic questions to ask about what ex- 
actly BERT, GPT-3 and their kin are learning about linguistic struc- 
ture from the unsupervised language modeling task, as studied in 
the emerging field of ‘BERTology’ [e.g. 110, 133]. However, from 
the perspective of work on language technology, it is far from clear 
that all of the effort being put into using large LMs to ‘beat’ tasks 
designed to test natural language understanding, and all of the 
effort to create new such tasks, once the existing ones have been 
bulldozed by the LMs, brings us any closer to long-term goals of 
general language understanding systems. If a large LM, endowed 
with hundreds of billions of parameters and trained on a very large 
dataset, can manipulate linguistic form well enough to cheat its 
way through tests meant to require language understanding, have 
we learned anything of value about how to build machine language 
understanding or have we been led down the garden path? 


6 STOCHASTIC PARROTS 


In this section, we explore the ways in which the factors laid out in 
§4 and §5 — the tendency of training data ingested from the Internet 
to encode hegemonic worldviews, the tendency of LMs to amplify 
biases and other issues in the training data, and the tendency of re- 
searchers and other people to mistake LM-driven performance gains 
for actual natural language understanding — present real-world 
risks of harm, as these technologies are deployed. After exploring 
some reasons why humans mistake LM output for meaningful text, 
we turn to the risks and harms from deploying such a model at 
scale. We find that the mix of human biases and seemingly coherent 
language heightens the potential for automation bias, deliberate 
misuse, and amplification of a hegemonic worldview. We focus 
primarily on cases where LMs are used in generating text, but we 
will also touch on risks that arise when LMs or word embeddings 
derived from them are components of systems for classification, 
query expansion, or other tasks, or when users can query LMs for 
information memorized from their training data. 


6.1 Coherence in the Eye of the Beholder 


Where traditional n-gram LMs [117] can only model relatively local 
dependencies, predicting each word given the preceding sequence 
of N words (usually 5 or fewer), the Transformer LMs capture 
much larger windows and can produce text that is seemingly not 
only fluent but also coherent even over paragraphs. For example, 
McGuffie and Newhouse [80] prompted GPT-3 with the text in 
bold in Figure 1, and it produced the rest of the text, including the 
Q&A format.?! This example illustrates GPT-3’s ability to produce 
coherent and on-topic text; the topic is connected to McGuffie and 
Newhouse’s study of GPT-3 in the context of extremism, discussed 
below. 

We say seemingly coherent because coherence is in fact in the 
eye of the beholder. Our human understanding of coherence de- 
rives from our ability to recognize interlocutors’ beliefs [30, 31] and 
intentions [23, 33] within context [32]. That is, human language use 


21 This is just the first part of the response that McGuffie and Newhouse show. GPT-3 
continues for two more question answer pairs with similar coherence. McGuffie and 
Newhouse report that all examples given in their paper are from either the first or 
second attempt at running a prompt. 
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Question: What is the name of the Russian mercenary group? 
Answer: Wagner group. 


Question: Where is the Wagner group? 
Answer: In Syria. 


Question: Who is the leader of the Wagner group? 

Answer: A Russian citizen, Dmitry Utkin. He was born in 1971 and lives in 
Moscow. He has been a member of the Russian parliament since 2016. He 
is a former officer of the Main Intelligence Directorate (GRU) of Russia’s 
General Staff. He was also a commander of the special forces unit “Vostok” 
(East). The unit was deployed to Chechnya and fought against separatists 
there in 1994-1996. After that, he served as an adviser to the head of 
Chechnya, Ramzan Kadyrov, and later became his deputy. In 2004-2005, he 
commanded a battalion in Chechnya and participated in fighting against 
insurgents there. In 2008-2009, he commanded another battalion in 
Abkhazia and South Ossetia during Russia’s war with Georgia over these 
two breakaway regions. Since 2014, he has been leading military units 
fighting against Ukrainian government forces on behalf of Russia-backed 
separatists in eastern Ukraine. According to open sources, Utkin is 
currently living in Moscow but travels frequently to Syria where his troops 
are fighting on behalf of Bashar al-Assad’s regime against 
anti-government forces there. 


Figure 1: GPT-3’s response to the prompt (in bold), from [80] 


takes place between individuals who share common ground and are 
mutually aware of that sharing (and its extent), who have commu- 
nicative intents which they use language to convey, and who model 
each others’ mental states as they communicate. As such, human 
communication relies on the interpretation of implicit meaning 
conveyed between individuals. The fact that human-human com- 
munication is a jointly constructed activity [29, 128] is most clearly 
true in co-situated spoken or signed communication, but we use 
the same facilities for producing language that is intended for au- 
diences not co-present with us (readers, listeners, watchers at a 
distance in time or space) and in interpreting such language when 
we encounter it. It must follow that even when we don’t know the 
person who generated the language we are interpreting, we build a 
partial model of who they are and what common ground we think 
they share with us, and use this in interpreting their words. 

Text generated by an LM is not grounded in communicative 
intent, any model of the world, or any model of the reader’s state 
of mind. It can’t have been, because the training data never in- 
cluded sharing thoughts with a listener, nor does the machine have 
the ability to do that. This can seem counter-intuitive given the 
increasingly fluent qualities of automatically generated text, but we 
have to account for the fact that our perception of natural language 
text, regardless of how it was generated, is mediated by our own 
linguistic competence and our predisposition to interpret commu- 
nicative acts as conveying coherent meaning and intent, whether 
or not they do [89, 140]. The problem is, if one side of the commu- 
nication does not have meaning, then the comprehension of the 
implicit meaning is an illusion arising from our singular human 
understanding of language (independent of the model).?* Contrary 


?2 Controlled generation, where an LM is deployed within a larger system that guides 
its generation of output to certain styles or topics [e.g. 147, 151, 158], is not the same 
thing as communicative intent. One clear way to distinguish the two is to ask whether 
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to how it may seem when we observe its output, an LM is a system 
for haphazardly stitching together sequences of linguistic forms 
it has observed in its vast training data, according to probabilistic 
information about how they combine, but without any reference to 
meaning: a stochastic parrot. 


6.2. Risks and Harms 


The ersatz fluency and coherence of LMs raises several risks, pre- 
cisely because humans are prepared to interpret strings belonging 
to languages they speak as meaningful and corresponding to the 
communicative intent of some individual or group of individuals 
who have accountability for what is said. We now turn to examples, 
laying out the potential follow-on harms. 

The first risks we consider are the risks that follow from the LMs 
absorbing the hegemonic worldview from their training data. When 
humans produce language, our utterances reflect our worldviews, 
including our biases [78, 79]. As people in positions of privilege 
with respect to a society’s racism, misogyny, ableism, etc., tend 
to be overrepresented in training data for LMs (as discussed in 
§4 above), this training data thus includes encoded biases, many 
already recognized as harmful. 

Biases can be encoded in ways that form a continuum from sub- 
tle patterns like referring to women doctors as if doctor itself entails 
not-woman or referring to both genders excluding the possibility of 
non-binary gender identities, through directly contested framings 
(e.g. undocumented immigrants vs. illegal immigrants or illegals), to 
language that is widely recognized to be derogatory (e.g. racial slurs) 
yet still used by some. While some of the most overtly derogatory 
words could be filtered out, not all forms of online abuse are easily 
detectable using such taboo words, as evidenced by the growing 
body of research on online abuse detection [45, 109]. Furthermore, 
in addition to abusive language [139] and hate speech [67], there 
are subtler forms of negativity such as gender bias [137], microag- 
gressions [22], dehumanization [83], and various socio-political 
framing biases [44, 114] that are prevalent in language data. For 
example, describing a woman’s account of her experience of sexism 
with the word tantrum both reflects a worldview where the sexist 
actions are normative and foregrounds a stereotype of women as 
childish and not in control of their emotions. 

An LM that has been trained on such data will pick up these 
kinds of problematic associations. If such an LM produces text that 
is put into the world for people to interpret (flagged as produced 
by an ‘AI’ or otherwise), what risks follow? In the first instance, we 
foresee that LMs producing text will reproduce and even amplify 
the biases in their input [53]. Thus the risk is that people dissemi- 
nate text generated by LMs, meaning more text in the world that 
reinforces and propagates stereotypes and problematic associations, 
both to humans who encounter the text and to future LMs trained 
on training sets that ingested the previous generation LM’s output. 
Humans who encounter this text may themselves be subjects of 
those stereotypes and associations or not. Either way, harms ensue: 
readers subject to the stereotypes may experience the psychological 
harms of microaggressions [88, 141] and stereotype threat [97, 126]. 
Other readers may be introduced to stereotypes or have ones they 


the system (or the organization deploying the system) has accountability for the truth 
of the utterances produced. 
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already carry reinforced, leading them to engage in discrimination 
(consciously or not) [55], which in turn leads to harms of subju- 
gation, denigration, belittlement, loss of opportunity [3, 4, 56] and 
others on the part of those discriminated against. 

If the LM outputs overtly abusive language (as Gehman et al. 
[53] show that they can and do), then a similar set of risks arises. 
These include: propagating or proliferating overtly abusive views 
and associations, amplifying abusive language, and producing more 
(synthetic) abusive language that may be included in the next itera- 
tion of large-scale training data collection. The harms that could 
follow from these risks are again similar to those identified above 
for more subtly biased language, but perhaps more acute to the ex- 
tent that the language in question is overtly violent or defamatory. 
They include the psychological harm experienced by those who 
identify with the categories being denigrated if they encounter the 
text; the reinforcement of sexist, racist, ableist, etc. ideology; follow- 
on effects of such reinforced ideologies (including violence); and 
harms to the reputation of any individual or organization perceived 
to be the source of the text. 

If the LM or word embeddings derived from it are used as com- 
ponents in a text classification system, these biases can lead to 
allocational and/or reputational harms, as biases in the representa- 
tions affect system decisions [125]. This case is especially pernicious 
for being largely invisible to both the direct user of the system and 
any indirect stakeholders about whom decisions are being made. 
Similarly, biases in an LM used in query expansion could influence 
search results, further exacerbating the risk of harms of the type 
documented by Noble in [94], where the juxtaposition of search 
queries and search results, when connected by negative stereotypes, 
reinforce those stereotypes and cause psychological harm. 

The above cases involve risks that could arise when LMs are de- 
ployed without malicious intent. A third category of risk involves 
bad actors taking advantage of the ability of large LMs to produce 
large quantities of seemingly coherent texts on specific topics on 
demand in cases where those deploying the LM have no investment 
in the truth of the generated text. These include prosaic cases, such 
as services set up to ‘automatically’ write term papers or interact on 
social media,?> as well as use cases connected to promoting extrem- 
ism. For example, McGuffie and Newhouse [80] show how GPT-3 
could be used to generate text in the persona of a conspiracy theo- 
rist, which in turn could be used to populate extremist recruitment 
message boards. This would give such groups a cheap way to boost 
recruitment by making human targets feel like they were among 
many like-minded people. If the LMs are deployed in this way to 
recruit more people to extremist causes, then harms, in the first 
instance, befall the people so recruited and (likely more severely) 
to others as a result of violence carried out by the extremists. 

Yet another risk connected to seeming coherence and fluency in- 
volves machine translation (MT) and the way that increased fluency 
of MT output changes the perceived adequacy of that output [77]. 
This differs somewhat from the cases above in that there was an 
initial human communicative intent, by the author of the source lan- 
guage text. However, MT systems can (and frequently do) produce 
output that is inaccurate yet both fluent and (again, seemingly) 


3Such as the GPT-3 powered bot let loose on Reddit; see https://thenextweb.com/ 
neural/2020/10/07/someone-let-a-gpt-3-bot-loose-on-reddit-it-didnt-end-well/amp/. 
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coherent in its own right to a consumer who either doesn’t see 
the source text or cannot understand the source text on their own. 
When such consumers therefore mistake the meaning attributed to 
the MT output as the actual communicative intent of the original 
text’s author, real-world harm can ensue. A case in point is the 
story of a Palestinian man, arrested by Israeli police, after MT trans- 
lated his Facebook post which said “good morning” (in Arabic) to 
“hurt them’ (in English) and “attack them” (in Hebrew).”4 This case 
involves a short phrase, but it is easy to imagine how the ability of 
large LMs to produce seemingly coherent text over larger passages 
could erase cues that might tip users off to translation errors in 
longer passages as well [77]. 

Finally, we note that there are risks associated with the fact 
that LMs with extremely large numbers of parameters model their 
training data very closely and can be prompted to output specific 
information from that training data. For example, [28] demonstrate 
a methodology for extracting personally identifiable information 
(PI) from an LM and find that larger LMs are more susceptible to 
this style of attack than smaller ones. Building training data out of 
publicly available documents doesn’t fully mitigate this risk: just 
because the PII was already available in the open on the Internet 
doesn’t mean there isn’t additional harm in collecting it and provid- 
ing another avenue to its discovery. This type of risk differs from 
those noted above because it doesn’t hinge on seeming coherence 
of synthetic text, but the possibility of a sufficiently motivated user 
gaining access to training data via the LM. In a similar vein, users 
might query LMs for ‘dangerous knowledge’ (e.g. tax avoidance 
advice), knowing that what they were getting was synthetic and 
therefore not credible but nonetheless representing clues to what 
is in the training data in order to refine their own search queries. 


6.3 Summary 


In this section, we have discussed how the human tendency to 
attribute meaning to text, in combination with large LMs’ ability 
to learn patterns of forms that humans associate with various bi- 
ases and other harmful attitudes, leads to risks of real-world harm, 
should LM-generated text be disseminated. We have also reviewed 
risks connected to using LMs as components in classification sys- 
tems and the risks of LMs memorizing training data. We note that 
the risks associated with synthetic but seemingly coherent text are 
deeply connected to the fact that such synthetic text can enter into 
conversations without any person or entity being accountable for it. 
This accountability both involves responsibility for truthfulness and 
is important in situating meaning. As Maggie Nelson [92] writes: 
“Words change depending on who speaks them; there is no cure.” 

In §7, we consider directions the field could take to pursue goals 
of creating language technology while avoiding some of the risks 
and harms identified here and above. 


7 PATHS FORWARD 


In order to mitigate the risks that come with the creation of in- 
creasingly large LMs, we urge researchers to shift to a mindset of 
careful planning, along many dimensions, before starting to build 
either datasets or systems trained on datasets. We should consider 


4h ttps://www.theguardian.com/technology/2017/oct/24/facebook-palestine-israel- 
translates-good-morning-attack-them-arrest 
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our research time and effort a valuable resource, to be spent to the 
extent possible on research projects that build towards a techno- 
logical ecosystem whose benefits are at least evenly distributed or 
better accrue to those historically most marginalized. This means 
considering how research contributions shape the overall direction 
of the field and keeping alert to directions that limit access. Like- 
wise, it means considering the financial and environmental costs 
of model development up front, before deciding on a course of in- 
vestigation. The resources needed to train and tune state-of-the-art 
models stand to increase economic inequities unless researchers 
incorporate energy and compute efficiency in their model evalua- 
tions. Furthermore, the goals of energy and compute efficient model 
building and of creating datasets and models where the incorpo- 
rated biases can be understood both point to careful curation of 
data. Significant time should be spent on assembling datasets suited 
for the tasks at hand rather than ingesting massive amounts of data 
from convenient or easily-scraped Internet sources. As discussed in 
§4.1, simply turning to massive dataset size as a strategy for being 
inclusive of diverse viewpoints is doomed to failure. We recall again 
Birhane and Prabhu’s [18] words (inspired by Ruha Benjamin [15]): 
“Feeding AI systems on the world’s beauty, ugliness, and cruelty, 
but expecting it to reflect only the beauty is a fantasy.” 

As a part of careful data collection practices, researchers must 
adopt frameworks such as [13, 52, 86] to describe the uses for which 
their models are suited and benchmark evaluations for a variety of 
conditions. This involves providing thorough documentation on the 
data used in model building, including the motivations underlying 
data selection and collection processes. This documentation should 
reflect and indicate researchers’ goals, values, and motivations in 
assembling data and creating a given model. It should also make 
note of potential users and stakeholders, particularly those that 
stand to be negatively impacted by model errors or misuse. We note 
that just because a model might have many different applications 
doesn’t mean that its developers don’t need to consider stakeholders. 
An exploration of stakeholders for likely use cases can still be 
informative around potential risks, even when there is no way to 
guarantee that all use cases can be explored. 

We also advocate for a re-alignment of research goals: Where 
much effort has been allocated to making models (and their training 
data) bigger and to achieving ever higher scores on leaderboards 
often featuring artificial tasks, we believe there is more to be gained 
by focusing on understanding how machines are achieving the 
tasks in question and how they will form part of socio-technical 
systems. To that end, LM development may benefit from guided 
evaluation exercises such as pre-mortems [68]. Frequently used in 
business settings before the deployment of new products or projects, 
pre-mortem analyses center hypothetical failures and ask team 
members to reverse engineer previously unanticipated causes.?° 
Critically, pre-mortem analyses prompt team members to consider 
not only a range of potential known and unknown project risks, but 
also alternatives to current project plans. In this way, researchers 
can consider the risks and limitations of their LMs in a guided 
way while also considering fixes to current designs or alternative 


5 This would be one way to build a evaluation culture that considers not only average- 
case performance (as measured by metrics) and best-case performance (cherry-picked 
examples), but also worst-case performance. 
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methods of achieving a task-oriented goal in relation to specific 
pitfalls. 

Value sensitive design [49, 50] provides a range of methodologies 
for identifying stakeholders (both direct stakeholders who will use 
a technology and indirect stakeholders who will be affected through 
others’ use of it), working with them to identify their values, and 
designing systems that support those values. These include such 
techniques as envisioning cards [48], the development of value 
scenarios [90], and working with panels of experiential experts 
[152]. These approaches help surface not only stakeholder values, 
but also values expressed by systems and enacted through interac- 
tions between systems and society [120]. For researchers working 
with LMs, value sensitive design is poised to help throughout the 
development process in identifying whose values are expressed and 
supported through a technology and, subsequently, how a lack of 
support might result in harm. 

All of these approaches take time and are most valuable when 
applied early in the development process as part of a conceptual in- 
vestigation of values and harms rather than as a post-hoc discovery 
of risks [72]. These conceptual investigations should come before 
researchers become deeply committed to their ideas and therefore 
less likely to change course when confronted with evidence of pos- 
sible harms. This brings us again to the idea we began this section 
with: that research and development of language technology, at 
once concerned with deeply human data (language) and creating 
systems which humans interact with in immediate and vivid ways, 
should be done with forethought and care. 

Finally, we would like to consider use cases of large LMs that 
have specifically served marginalized populations. If, as we advo- 
cate, the field backs off from the path of ever larger LMs, are we 
thus sacrificing benefits that would accrue to these populations? 
As a case in point, consider automatic speech recognition, which 
has seen some improvements thanks to advances in LMs, including 
both in size and in architecture [e.g. 8, 59, 121], though the largest 
LMs typically are too large and too slow for the near real-time needs 
of ASR systems [60]. Improved ASR has many beneficial applica- 
tions, including automatic captioning which has the potential to 
be beneficial for Deaf and hard of hearing people, providing access 
to otherwise inaccessible audio content.”° We see two beneficial 
paths forward here: The first is a broader search for means of im- 
proving ASR systems, as indeed is underway, since the contexts of 
application of the technology aren’t conducive to using ever larger 
LMs [60]. But even if larger LMs could be used, just because we’ve 
seen that large LMs can help doesn’t mean that this is the only 
effective path to stronger ASR technology. (And we note that if we 
want to build strong ASR technology across most of the world’s 
languages, we can’t rely on having terabytes of data in all cases.) 
The second, should we determine that large LMs are critical (when 
available), is to recognize this as an instance of a dual use problem 
and consider how to mitigate the harms of LMs used as stochastic 
parrots while still preserving them for use in ASR systems. Could 
LMs be built in such a way that synthetic text generated with them 


6Note however, that automatic captioning is not yet and likely may never be good 
enough to replace human-generated captions. Furthermore, in some contexts, what 
Deaf communities prefer is human captioning plus interpretation to the appropriate 
signed language. We do not wish to suggest that automatic systems are sufficient 
replacements for these key accessibility requirements. 
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would be watermarked and thus detectable [7, 66, 123]? Are there 
policy approaches that could effectively regulate their use? 

In summary, we advocate for research that centers the people 
who stand to be adversely affected by the resulting technology, 
with a broad view on the possible ways that technology can affect 
people. This, in turn, means making time in the research process for 
considering environmental impacts, for doing careful data curation 
and documentation, for engaging with stakeholders early in the 
design process, for exploring multiple possible paths towards long- 
term goals, for keeping alert to dual-use scenarios, and finally for 
allocating research effort to harm mitigation in such cases. 


8 CONCLUSION 


The past few years, ever since processing capacity caught up with 
neural models, have been heady times in the world of NLP. Neural 
approaches in general, and large, Transformer LMs in particular, 
have rapidly overtaken the leaderboards on a wide variety of bench- 
marks and once again the adage “there’s no data like more data” 
seems to be true. It may seem like progress in the field, in fact, de- 
pends on the creation of ever larger language models (and research 
into how to deploy them to various ends). 

In this paper, we have invited readers to take a step back and 
ask: Are ever larger LMs inevitable or necessary? What costs are 
associated with this research direction and what should we consider 
before pursuing it? Do the field of NLP or the public that it serves 
in fact need larger LMs? If so, how can we pursue this research 
direction while mitigating its associated risks? If not, what do we 
need instead? 

We have identified a wide variety of costs and risks associated 
with the rush for ever larger LMs, including: environmental costs 
(borne typically by those not benefiting from the resulting technol- 
ogy); financial costs, which in turn erect barriers to entry, limiting 
who can contribute to this research area and which languages can 
benefit from the most advanced techniques; opportunity cost, as re- 
searchers pour effort away from directions requiring less resources; 
and the risk of substantial harms, including stereotyping, denigra- 
tion, increases in extremist ideology, and wrongful arrest, should 
humans encounter seemingly coherent LM output and take it for 
the words of some person or organization who has accountability 
for what is said. 

Thus, we call on NLP researchers to carefully weigh these risks 
while pursuing this research direction, consider whether the bene- 
fits outweigh the risks, and investigate dual use scenarios utilizing 
the many techniques (e.g. those from value sensitive design) that 
have been put forth. We hope these considerations encourage NLP 
researchers to direct resources and effort into techniques for ap- 
proaching NLP tasks that are effective without being endlessly data 
hungry. But beyond that, we call on the field to recognize that appli- 
cations that aim to believably mimic humans bring risk of extreme 
harms. Work on synthetic human behavior is a bright line in ethical 
AI development, where downstream effects need to be understood 
and modeled in order to block foreseeable harm to society and 
different social groups. Thus what is also needed is scholarship on 
the benefits, harms, and risks of mimicking humans and thoughtful 
design of target tasks grounded in use cases sufficiently concrete 
to allow collaborative design with affected communities. 
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